Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that comballoc.ml does not reverse the order of allocations #1917

Merged
merged 2 commits into from Nov 14, 2018

Conversation

stedolan
Copy link
Contributor

@stedolan stedolan commented Jul 19, 2018

Since their fields are provided at creation time, immutable blocks point only to older blocks. The current GC does not rely on this invariant, but multicore does.

Since the minor heap is filled from higher addresses to lower (allocation is subtraction from the heap pointer), this means that immutable blocks will point to higher addresses. Specifically, multicore relies on the invariant that all fields in the minor heap containing pointers:

  • point to the major heap, or
  • point to a higher address in the minor heap, or
  • were created by caml_modify

This invariant is broken by comballoc, which combines multiple allocations into one, since the current implementation of comballoc actually reverses the allocation order, which caused a nasty bug in multicore (ocaml-multicore/ocaml-multicore#131).

This patch makes comballoc preserve the order of allocation. Here's the diff in the -dlinear output of fun x -> 1 :: 2 :: 3 :: x:

 camlComb__f_1199: {comb.ml:1,6-26}
   x/29[%rbx] := R/0[%rax]
   {x/29[%rbx]*}
   V/30[%rax] := alloc 72 {comb.ml:1,20-26}
+  V/30[%rax] := V/30[%rax] + 48 {comb.ml:1,20-26}
   [V/30[%rax] + -8] := 2048 (init)
   [V/30[%rax]] := 7 (init)
   val[V/30[%rax] + 8] := x/29[%rbx] (init)
-  V/31[%rbx] := V/30[%rax] + 24
+  V/31[%rbx] := V/30[%rax] + -24 {comb.ml:1,15-26}
   [V/31[%rbx] + -8] := 2048 (init)
   [V/31[%rbx]] := 5 (init)
   val[V/31[%rbx] + 8] := V/30[%rax] (init)
-  V/32[%rax] := V/30[%rax] + 48
+  V/32[%rax] := V/31[%rbx] + -24 {comb.ml:1,10-26}
   [V/32[%rax] + -8] := 2048 (init)
   [V/32[%rax]] := 3 (init)
   val[V/32[%rax] + 8] := V/31[%rbx] (init)

The other difference visible above is that the new version uses instr_cons_debug instead of instr_cons, so preserves more location information.

@gasche
Copy link
Member

gasche commented Jul 20, 2018

As someone who doesn't know much about the backend, I am a bit perplexed by your description.

I don't understand how the -dlinear output, before the patch, would create a pointer into a younger object. It seems that the pointers are created in the instructions of the form val[V/(N+1) + 8] := V/N, and they write the address of the already-initialized (older) N-th block into the cdr field of the (N+1)-th block.

The main difference that I see is that you fill the blocks from the right to the left (where "right" is the larger adresses), instead of filling from the left to the right. I wondered if that directed was related to creation time, but with a linear model of the minor heap, I would rather assume that non-combined allocations are created in increasing address order, giving a layout closer to the previous version that the newly proposed one.

@stedolan
Copy link
Contributor Author

stedolan commented Jul 20, 2018

This was confusingly worded! I've updated the description, and hopefully it makes more sense now.

The main difference that I see is that you fill the blocks from the right to the left (where "right" is the larger adresses), instead of filling from the left to the right.

Yes, this is the point of the patch. The desired invariant is roughly "immutable pointers point right".

I wondered if that directed was related to creation time, but with a linear model of the minor heap, I would rather assume that non-combined allocations are created in increasing address order, giving a layout closer to the previous version that the newly proposed one.

This is actually false. On the minor heap, allocations are created in decreasing address order, and allocation is implemented as a subtraction.

@gasche
Copy link
Member

gasche commented Jul 20, 2018

That clarifies things, thanks!

Now that I understand the code better, I have minor questions on the differences in code generation.

The most obvious difference is that you have one additional arithmetic operation at the first combined allocation, to go the end of the block. I expect the performance cost to be negligible. Regarding code compacity, your are careful to check (if newsz = sz) when there is in fact only one block in total, and not emit there, so it only occurs on multi-block situations, so it should amortize. I'm not worried about that change.

On the other hand, there is a subtler difference in the code generation, which in some sense corresponds to the change from offsets to sizes, in the case where allocstate is a Pending_alloc. With the old code, we would call ourselves recursively with the same reg register that we got in argument; with the new code, you call yourself with i.res.(0) (as in the allocation case). This corresponds to a difference between computing all sub-block addresses by offset from a common base address, or iteratively adding to one sub-address to get the next one.

I wonder if this may cause a performance difference. If I understand correctly, modern processors perform reasoning on data dependencies to run code in a pipelined and parallel way, your new approach introduces much shorter dependencies (each address on the previous one) than the old.

I'm not sure the switch from offsets to sizes is that important to fill the blocks in the opposite order. Couldn't we have kept exactly the same code generation structure as before, but changed the sub-block address computation from reg + ofs to reg + (newsz - ofs), and gotten the same properties? (Maybe that prevents your nice factorization of the "first allocation" and the "above young_wosize" cases?)

@stedolan
Copy link
Contributor Author

I'm not sure the switch from offsets to sizes is that important to fill the blocks in the opposite order. Couldn't we have kept exactly the same code generation structure as before, but changed the sub-block address computation from reg + ofs to reg + (newsz - ofs), and gotten the same properties? (Maybe that prevents your nice factorization of the "first allocation" and the "above young_wosize" cases?)

Honestly, this was just the first way that came to mind. Your approach sounds simpler, I'll try it and see how it looks.

I wonder if this may cause a performance difference. If I understand correctly, modern processors perform reasoning on data dependencies to run code in a pipelined and parallel way, your new approach introduces much shorter dependencies (each address on the previous one) than the old.

You are correct, but the difference will not be noticeable here. On modern x86s, the results of a register/register addition are available on the very next cycle. Due to redundant ALUs, 2-4 independent register/register additions can be dispatched on the same cycle, so there is in principle a performance benefit from avoiding the dependency. However, the x86 code produced by ocamlopt is not tuned to the point where we'd notice the effect of shaving off a fraction of a cycle. (I'm less knowledgeable about other architectures, but generally speaking the benefits of avoiding dependency chains only really show up with operations more time-consuming than a register-register add. BTW, for x86 details, the best references is https://www.agner.org/optimize/)

@mshinwell
Copy link
Contributor

Having spent quite a lot of my working life doing code review, I'd like to make a general observation about the process, which I have mentioned to some people in the past. I think it is relevant to the discussion above and also to other pull requests where I have seen a similar pattern.

I believe that the functions of code review should be to establish correctness and maintainability of the code. I think it's a mistake, albeit an easy one to make, to let code review slip into "but maybe it could have been done this way instead". Everyone writes things differently and, so long as those two criteria (correctness and maintainability) are satisfied, I think it is rarely a good use of time to go down this path however well-meaning the intentions of the reviewer. It causes delays, takes time away from other projects and I wouldn't be surprised if it also runs the risk of net introducing more bugs.

In this particular case, this patch (which I think is a blocker for multicore) has been around since mid-July, and from @gasche 's review and @stedolan 's comments, appears to be satisfactory without being rewritten, tested and thought about again. Can we just merge it?

@gasche
Copy link
Member

gasche commented Sep 26, 2018

I would be willing to follow what @stedolan prefers here (he had and still has the option to say "you know what, let's go with the PR as stated").

I haven't approved the current patch because this code is tricky logic and it takes a non-small effort to convince oneself that it corresponds to the previous one, I haven't done the work yet. Mark, if you are confident that the change is correct, feel free to approve it.

@mshinwell
Copy link
Contributor

I will look over it as well then.

Copy link
Contributor

@mshinwell mshinwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. OK for merging after a couple of small things addressed, I don't need to review again.

i.arg i.res i.dbg newnext, ofs)
end
| _ ->
let (newnext, newsz) = combine i.next (Pending_alloc(i.res.(0), sz)) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename newsz -> totalsz, to match the comment etc.?

label_after_call_gc = None; }))
i.arg i.res i.dbg newnext, ofs)
end
| _ ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match should be exhaustive, | No_alloc | Pending_alloc _ ->.

combine i.next (Pending_alloc(reg, ofs + sz)) in
(instr_cons (Iop(Iintop_imm(Iadd, ofs))) [| reg |] i.res newnext,
| Pending_alloc(reg, totalsz)
when totalsz + sz < Config.max_young_wosize * Arch.size_addr ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why is the multiplication by Arch.size_addr here? (Yes, I know it was there before...)

A: It looks like I messed up when adding the inline record field called words. It should be called bytes instead. I will make a pull request to fix this. (This shows the value of inline record fields rather than anonymous constructor arguments, especially when units of measure are concerned that do not have their own types.)

@@ -20,9 +20,9 @@ open Mach
type allocation_state =
No_alloc (* no allocation is pending *)
| Pending_alloc of Reg.t * int (* an allocation is pending *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should maybe use an inline record in order to give a proper name to the "int". In fact, maybe the names could be sufficiently good that the comment can then be removed.

| _ ->
let (newnext, newsz) = combine i.next (Pending_alloc(i.res.(0), sz)) in
let newnext =
if newsz = sz then newnext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be more immediately clear with something like this, having done the rename of newsz:

let required_offset = totalsz - sz in
if required_offset <= 0 then newnext
else instr_cons_debug ... required_offset ...

@stedolan
Copy link
Contributor Author

Apologies that this took so long to get to. I've made the coding style changes @mshinwell suggests.

I had originally intended to try @gasche's suggested approach, but it would take much longer, and realistically I'm unlikely to find the time soon. (I've not been working on this code for some time, so it would take a while to remember the details).

@mshinwell
Copy link
Contributor

This is ok to merge once CI passes.

@mshinwell mshinwell merged commit 63add57 into ocaml:trunk Nov 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants